quickwit: add tag_fields on CounterID, drop positions on raw text by alexey-milovidov · Pull Request #877 · ClickHouse/ClickBench

alexey-milovidov · 2026-05-08T18:51:01Z

Summary

Two index-level changes to quickwit/index_config.yaml, keeping the rest of the benchmark setup identical.

tag_fields: [CounterID] — Q37-Q43 all filter CounterID = 62. Tagging it writes the per-split CounterID values into the metastore so the searcher can prune whole splits before opening them. This is the closest analogue we get to Elasticsearch's index.sort early-termination on the same column. Quickwit/Tantivy has no real multi-column doc-sort to match the full ES sort.field: [CounterID, EventDate, UserID, EventTime, WatchID], so this picks up just the CounterID dimension.
record: basic on every tokenizer: raw text field (28 fields). Tantivy defaults text postings to WithFreqsAndPositions, but raw-tokenized fields only ever hold one term per document — phrase queries can't run against them, so freqs and positions are dead weight in the index.

Validated against the running v0.9.0-nightly server (the same image benchmark.sh uses): the tag_fields and record: basic settings round-trip cleanly through the index-create API.

Test plan

Re-run bash benchmark.sh end-to-end on a fresh machine
Compare cold + warm timings against the previous results, especially Q37–Q43 (CounterID filter) for the tag_fields benefit
Confirm load time and on-disk size — both should stay flat or shrink slightly thanks to record: basic

🤖 Generated with Claude Code

Add Quickwit entry

Some historical clickhouse-cloud entries stored cluster_size as a JSON string ("1", "2", "3") while modern ones use plain integers (1, 2, 3). The dashboard treats the two representations as distinct values and renders cluster_size 2 and 3 twice in selectors. Convert all string-numeric cluster_size values to integers across the repo. Non-numeric strings (serverless, dedicated) are left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

generate-results.sh used to take, for every (system, basename), the latest dated copy across all date subdirectories. That meant the website would still surface a benchmark machine after the system was re-run on a new date that no longer covers that machine. Switch the rule: for each <system>/results/, find the lexicographically greatest YYYYMMDD subdirectory and emit every file it contains. Older subdirs remain in the repo as history but are not rendered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Coerce cluster_size to integer

# Conflicts: # data.generated.js

Use only the latest date subdir of each system for the dashboard

This reverts commit 5226887, reversing changes made to 153c080.

Revert #874: restore previous generate-results.sh behavior

When all selected systems return null for a query, the per-query baseline becomes Math.min() over an empty set (Infinity), which makes log(curr/Infinity) = -Infinity and collapses every system's geometric mean to 0 - bars render with width 0 and the chart appears empty. Reproduction: filter to Elasticsearch + Quickwit (Q28's REGEXP_REPLACE fails on both). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skip queries that fail on every filtered system

The c8g.metal-48xl run was committed in 2b124ba ("Update clickhouse-datalake-partitioned results", authored 2026-02-18) with the date field accidentally set to "2027-02-18". The restructure in bb91b0c then put it under results/20270218/ — making it the lexicographically-latest dir despite containing a single machine, which masked the real latest dir (20260506). Move the file to results/20260218/ alongside the other 2026-02-18 results from the same commit, and correct the date field. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Identify obsolete results by comparing each system's older-dated result files against the canonical pre-refactor flat layout (`bb91b0cf5~1`, the commit just before the date-subdir restructure). Any older-dated file whose basename was not in that flat layout represents a machine/configuration that had already been deleted from the canonical state — mark those `"historical"` so the dashboard doesn't surface them. This catches: - Old ClickHouse Cloud configurations (Dedicated, colder-cache and parallel-replicas experiments, retired size tiers 40 / 56 / 80 / 128 / 240 GiB). - Old ClickHouse hardware (c5.4xlarge, m5d.24xlarge, m6i.32xlarge, *.zstd, *.tuned, *.tuned.memory, c5n.4xlarge for clickhouse-web). - Per-system retired runs (DataFusion `f16s_v2` and old `single.json`, MotherDuck `result.json`/`result_*`/`pulse`/ `standard`, polars and polars-dataframe retired filenames, starrocks `*.untuned`, paradedb 1500GB, hydra `c6a.4xlarge` (now on `hydra.json`), etc). Also drops tags from older results that aren't in the union of tags in the system's latest dated subdir, except `"historical"` (catches residuals like `"analytical"`, `"MySQL compatible"` in old databend, `"Python"` in arc, `"open-source"`/`"dataframe"`/`"parquet"` in polars, etc — applying the rules from earlier tag-removal commits d661b49 / 46a535b / fb09092 / ae85f0d / 0aab48e to the historical copies that still carried the deprecated tags). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the manual arch-detection + zip download with the official installer at install.gizmosql.com, mirroring the pattern DuckDB uses in this repo. The installer handles arch/OS detection and installs to ~/.local/bin by default, which we then prepend to PATH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…3a.small - datafusion/results/<YYYYMMDD>/single.json renamed to c6a.4xlarge.json to match the per-machine naming used everywhere else; the historical tag is removed from those files since they no longer represent an obsolete basename. - datafusion/results/20250522/single.json deleted as redundant — c6a.4xlarge.json already exists in the same dir with identical metadata and identical numeric results (the only diffs are trailing-zero formatting). - duckdb-vortex/results/20250521/c6a.4xlarge-single.json deleted for the same reason — same date / system / machine / metadata as the canonical c6a.4xlarge.json next to it. - firebolt-parquet{,-partitioned}/results/20260221/t3a.small.json removed entirely; those entries were incorrect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both systems had genuine standalone runs on AWS hardware that were incorrectly tagged "historical" by the pre-refactor flat-layout heuristic — the flat layout only kept the most recent canonical machine per system, so older one-off machines looked obsolete even though the run is still meaningful as a historical comparison point. - glaredb/results/20240202/c6a.metal.json — drop historical - hydra/results/{20221209,20230919}/c6a.4xlarge.json — drop historical Also delete glaredb/results/20250525/c6a.4xlarge-parquet-single.json as redundant (same date / system / machine / metadata as the canonical c6a.4xlarge.json next to it; numerical results identical, only trailing-zero formatting differs — same situation as the datafusion/duckdb-vortex *-single duplicates removed in the previous commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- motherduck/results/{20240127,20241029}/result.json renamed to result_standard.json. The runs were originally machine="cloud" (back when Motherduck only offered one tier); update machine to "Motherduck: standard" to match current naming and drop the historical tag. - paradedb/results/20240202/c6a.4xlarge.1500gb.json deleted — identical results to c6a.4xlarge.json next to it; the .1500gb filename was a one-off clarification for an Elasticsearch comparison per its comment field. The canonical c6a.4xlarge.json carries the same numbers without that side-comment. - paradedb/results/20240713/single.json deleted — same date / system / machine / load_time / data_size as the canonical c6a.4xlarge.json next to it; results differ only by tiny numerical noise (<= 0.001s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- polars/results/20241129/DataFrame_c6a.metal.json moved to polars-dataframe/results/20241129/c6a.metal.json (the run is system="Polars (DataFrame)", so it belongs in polars-dataframe). - polars/results/{20241129,20241215}/parquet_c6a.metal.json / parquet_c6a.4xlarge.json renamed to drop the parquet_ prefix (parquet is the default encoding for polars/, so the prefix is redundant — polars-dataframe/ is the dataframe variant). - Historical tag dropped from all three renamed files. The pre-existing canonical c6a.metal.json / c6a.4xlarge.json in those date dirs were re-runs that ended up there because their date field wasn't updated when the data was refreshed in commit 69d3e50; the renamed files carry the actual 2024-11-29 / 2024-12-15 numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- starrocks/results/{20220715,20220925}/*.untuned.json — old untuned variant from when both tuned and untuned runs were captured separately. The canonical c6a.4xlarge.json / c6a.metal.json next to them already record an untuned run (tuned="no") with the modern schema. - timescaledb/results/20220701/c6a.4xlarge.compression.json — old compression-on variant; the canonical c6a.4xlarge.json carries the standard TimescaleDB run for that date. - trino{,-partitioned}/results/202605{06,07}/c8g*.json — c8g runs removed entirely (per maintainer instruction). - umbra/results/20251026/c6a.{2xlarge,xlarge}.json — incorrect results, removed entirely. - arc/results/2025*/m3_max*.json — m3_max runs removed entirely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…large,metal-48xl} - starrocks/results/{20220715,20220925}/{c6a.4xlarge,c6a.metal}.json replaced with the content of the previously-deleted *.untuned.json files. The untuned numbers are the right canonical record for those dates (the prior "tuned" canonical was a parallel run that wasn't the one used to establish the historical entry). Drops the "historical" tag and the "ClickHouse derivative" tag (not in latest starrocks tag set), keeps system="StarRocks". - trino{,-partitioned}/results/20260507/c8g.4xlarge.json and c8g.metal-48xl.json restored. Per maintainer note, only c8g.24xlarge.json was supposed to be removed; the other two c8g variants stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mark stale results historical to clean up the dashboard

motherduck/ uses lowercase tier names ("Motherduck: jumbo", "Motherduck: mega", "Motherduck: standard"); pg_duckdb-motherduck/ had three files with "Motherduck: Jumbo" (capital J). Lower-case the J so the dashboard groups all jumbo-tier runs under one machine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rt sizes For cloud-service results whose .machine value contains a memory size (GB / GiB) or a T-shirt size (XS / S / M / L / XL / NXL etc), drop the redundant cloud-name prefix so the dashboard groups runs by the actual size rather than the (system, machine) tuple. The system field on each entry already carries the cloud name; repeating it inside .machine just bloats the X axis. Also normalize T-shirt sizing variants to their letter form: Small → S, Medium → M, Large → L, X-Small → XS, X-Large → XL, 2X-Small → 2XS, 2X-Large → 2XL, 3X-Large → 3XL, 4X-Large → 4XL, 5X-Large → 5XL. Affected systems: AlloyDB, ByteHouse, CHYT, ClickHouse Cloud (every aws/azure/gcp tier), CrunchyBridge, Databricks, Hydra, Snowflake, Supabase, Tablespace, Timescale Cloud, pgpro_tam. Bare-metal hardware descriptions (CPU model + RAM, "AWS c5.metal 100GB", etc) are left unchanged — the rule applies to managed-cloud machine labels only. Aurora's "16acu", Hologres' "16 CU", Redshift's "ra3.4xlarge", and SingleStore's "S2"/"S24" don't match the GB or T-shirt-size pattern and are also left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Normalize Motherduck Jumbo to jumbo

Convert "<digits><space?>GB" → "<digits>GiB" in cloud-service machine names. Where the value also carries an "<N> vCPU " prefix in front of the GB amount, drop that prefix — the GiB tier already conveys the size, so "8 vCPU 64 GB" simplifies to "64GiB". Examples: - "8 vCPU 64 GB" (AlloyDB) → "64GiB" - "10 vCPU 40GB" (CHYT) → "40GiB" - "720GB" (CHYT) → "720GiB" - "Analytics-256GB" (Crunchy Bridge) → "Analytics-256GiB" - "L1 - 16CPU 32GB" (Tablespace) → "L1 - 16CPU 32GiB" (16CPU is not "vCPU" so it stays) - "8 vCPU 32GB" (Timescale ☁️) → "32GiB" - "16 vCPU 32GB" / "30 vCPU 480GB" (pgpro_tam) → "32GiB" / "480GiB" - "64 vCPU 256GB" (YDB) → "256GiB" Bare-metal hardware descriptions in hardware/, versions/, gravitons/ (e.g. "AWS c5.metal 100GB", "Linode 16GB", "Steam Deck 512 GB", "AMD EPYC 3.2 GHz, Micron 5100 MAX 960 GB") are left alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Normalize machine names: drop redundant cloud prefix, normalize T-shirt sizes

The c7i.metal-48xl runs in chdb / chdb-dataframe / chdb-parquet-partitioned were one-off captures that aren't part of the canonical machine set (the canonical chdb suite uses c6a / c7a / c8g variants). Tag them "historical" so they stop appearing on the dashboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

"Analytics-256GiB" → "256GiB". The system field already says "Crunchy Bridge (Parquet)", so the "Analytics-" prefix is redundant once the cloud-name has been dropped from the machine label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Keep only the RAM size as the machine label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Normalize machine names

GizmoSQL: use the official one-line installer

Revert #845

Revert "Revert #845"

CLAassistant · 2026-05-09T11:38:09Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
3 out of 4 committers have signed the CLA.

✅ alexey-milovidov
✅ rschu1ze
✅ prmoore77
❌ github

github seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

tag_fields: [CounterID] writes per-split CounterID values into the metastore so the searcher can prune whole splits before opening them for queries 37-43, which all filter CounterID = 62 — the closest analogue to Elasticsearch's index.sort early-termination here. record: basic on every tokenizer: raw text field skips storing freqs and positions in the postings; phrase queries can never run against single-term raw fields, so the data was dead weight. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov and others added 30 commits May 8, 2026 20:11

Merge pull request #865 from ClickHouse/add-quickwit-entry

4b52e23

Add Quickwit entry

[bot] Build the website

45e7ddd

Merge pull request #873 from ClickHouse/cluster-size-int

153c080

Coerce cluster_size to integer

Merge remote-tracking branch 'origin/main' into website-latest-snapshot

33a219f

# Conflicts: # data.generated.js

Merge pull request #874 from ClickHouse/website-latest-snapshot

5226887

Use only the latest date subdir of each system for the dashboard

[bot] Build the website

fce02cd

Revert "Merge pull request #874 from ClickHouse/website-latest-snapshot"

743d133

This reverts commit 5226887, reversing changes made to 153c080.

Merge pull request #875 from ClickHouse/revert-pr-874

73676c2

Revert #874: restore previous generate-results.sh behavior

[bot] Build the website

685f0ca

Merge pull request #876 from ClickHouse/fix-chart-all-fail-queries

91ab875

Skip queries that fail on every filtered system

Merge pull request #878 from ClickHouse/mark-stale-results-historical

c8227d4

Mark stale results historical to clean up the dashboard

[bot] Build the website

eefb9e9

Merge pull request #880 from ClickHouse/normalize-motherduck-jumbo

32612bf

Normalize Motherduck Jumbo to jumbo

[bot] Build the website

9824312

Merge pull request #881 from ClickHouse/normalize-machine-names

806441e

Normalize machine names: drop redundant cloud prefix, normalize T-shirt sizes

github and others added 16 commits May 8, 2026 20:05

[bot] Build the website

2e5e726

Tablespace: simplify "L1 - 16CPU 32GiB" to "32GiB"

0d5cd6c

Keep only the RAM size as the machine label. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #882 from ClickHouse/normalize-machine-names

5cbd9b1

Normalize machine names

[bot] Build the website

a0b6d6a

Update changelog

81699f5

Merge pull request #879 from gizmodata/gizmosql-cli-client

f8ff862

GizmoSQL: use the official one-line installer

Revert #845

223580c

Cosmetics

015edcf

[bot] update results for ClickHouse Cloud

71a4ff8

Merge pull request #884 from ClickHouse/revert-845

a4bef2a

Revert #845

[bot] Build the website

0b08059

Revert "Revert #845"

61c339c

Merge pull request #885 from ClickHouse/revert-884-revert-845

3acef83

Revert "Revert #845"

[bot] Build the website

2c3610b

alexey-milovidov and others added 2 commits May 9, 2026 11:39

Update results

5bb3d7e

alexey-milovidov force-pushed the quickwit-tag-fields-record-basic branch from 76a4092 to 5bb3d7e Compare May 9, 2026 11:39

alexey-milovidov closed this May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quickwit: add tag_fields on CounterID, drop positions on raw text#877

quickwit: add tag_fields on CounterID, drop positions on raw text#877
alexey-milovidov wants to merge 48 commits into
add-quickwit-entryfrom
quickwit-tag-fields-record-basic

alexey-milovidov commented May 8, 2026

Uh oh!

CLAassistant commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

alexey-milovidov commented May 8, 2026

Summary

Test plan

Uh oh!

CLAassistant commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented May 9, 2026 •

edited

Loading